Bilingual Word Embeddings from Parallel and Non-parallel Corpora for Cross-Language Text Classification
نویسندگان
چکیده
In many languages, sparse availability of resources causes numerous challenges for textual analysis tasks. Text classification is one of such standard tasks that is hindered due to limited availability of label information in lowresource languages. Transferring knowledge (i.e. label information) from high-resource to low-resource languages might improve text classification as compared to the other approaches like machine translation. We introduce BRAVE (Bilingual paRAgraph VEctors), a model to learn bilingual distributed representations (i.e. embeddings) of words without word alignments either from sentencealigned parallel or label-aligned non-parallel document corpora to support cross-language text classification. Empirical analysis shows that classification models trained with our bilingual embeddings outperforms other stateof-the-art systems on three different crosslanguage text classification tasks.
منابع مشابه
Learning Bilingual Sentiment Word Embeddings for Cross-language Sentiment Classification
The sentiment classification performance relies on high-quality sentiment resources. However, these resources are imbalanced in different languages. Cross-language sentiment classification (CLSC) can leverage the rich resources in one language (source language) for sentiment classification in a resource-scarce language (target language). Bilingual embeddings could eliminate the semantic gap bet...
متن کاملA Variational Autoencoding Approach for Inducing Cross-lingual Word Embeddings
Cross-language learning allows one to use training data from one language to build models for another language. Many traditional approaches require word-level alignment sentences from parallel corpora, in this paper we define a general bilingual training objective function requiring sentence level parallel corpus only. We propose a variational autoencoding approach for training bilingual word e...
متن کاملAn Efficient Cross-lingual Model for Sentence Classification Using Convolutional Neural Network
In this paper, we propose a cross-lingual convolutional neural network (CNN) model that is based on word and phrase embeddings learned from unlabeled data in two languages and dependency grammar. Compared to traditional machine translation (MT) based methods for cross lingual sentence modeling, our model is much simpler and does not need parallel corpora or language specific features. We only u...
متن کاملBeyond Bilingual: Multi-sense Word Embeddings using Multilingual Context
Word embeddings, which represent a word as a point in a vector space, have become ubiquitous to several NLP tasks. A recent line of work uses bilingual (two languages) corpora to learn a different vector for each sense of a word, by exploiting crosslingual signals to aid sense identification. We present a multi-view Bayesian non-parametric algorithm which improves multi-sense word embeddings by...
متن کاملUsing Word Embeddings to Translate Named Entities
In this paper we investigate the usefulness of neural word embeddings in the process of translating Named Entities (NEs) from a resource-rich language to a language low on resources relevant to the task at hand, introducing a novel, yet simple way of obtaining bilingual word vectors. Inspired by observations in (Mikolov et al., 2013b), which show that training their word vector model on compara...
متن کامل